Pesquisa | Portal Regional da BVS

1.

Concentration of inverted repeats along human DNA.

Bastos, Carlos A C; Afreixo, Vera; Rodrigues, João M O S; Pinho, Armando J.

J Integr Bioinform ; 20(2)2023 Jun 01.

Artigo em Inglês | MEDLINE | ID: mdl-37486620

RESUMO

This work aims to describe the observed enrichment of inverted repeats in the human genome; and to identify and describe, with detailed length profiles, the regions with significant and relevant enriched occurrence of inverted repeats. The enrichment is assessed and tested with a recently proposed measure (z-scores based measure). We simulate a genome using an order 7 Markov model trained with the data from the real genome. The simulated genome is used to establish the critical values which are used as decision thresholds to identify the regions with significant enriched concentrations. Several human genome regions are highly enriched in the occurrence of inverted repeats. This is observed in all the human chromosomes. The distribution of inverted repeat lengths varies along the genome. The majority of the regions with severely exaggerated enrichment contain mainly short length inverted repeats. There are also regions with regular peaks along the inverted repeats lengths distribution (periodic regularities) and other regions with exaggerated enrichment for long lengths (less frequent). However, adjacent regions tend to have similar distributions.

2.

Distribution of Distances Between Symmetric Words in the Human Genome: Analysis of Regular Peaks.

Bastos, Carlos A C; Afreixo, Vera; Rodrigues, João M O S; Pinho, Armando J; Silva, Raquel M.

Interdiscip Sci ; 11(3): 367-372, 2019 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-30911903

RESUMO

Finding DNA sites with high potential for the formation of hairpin/cruciform structures is an important task. Previous works studied the distances between adjacent reversed complement words (symmetric word pairs) and also for non-adjacent words. It was observed that for some words a few distances were favoured (peaks) and that in some distributions there was strong peak regularity. The present work extends previous studies, by improving the detection and characterization of peak regularities in the symmetric word pairs distance distributions of the human genome. This work also analyzes the location of the sequences that originate the observed strong peak periodicity in the distance distribution. The results obtained in this work may indicate genomic sites with potential for the formation of hairpin/cruciform structures.

Assuntos

DNA/química , Genoma Humano , Algoritmos , Cromossomos Humanos , Bases de Dados Genéticas , Genômica , Humanos , Modelos Genéticos , Conformação de Ácido Nucleico , Análise de Sequência de DNA/métodos , Software

3.

DNA word analysis based on the distribution of the distances between symmetric words.

Tavares, Ana H M P; Pinho, Armando J; Silva, Raquel M; Rodrigues, João M O S; Bastos, Carlos A C; Ferreira, Paulo J S G; Afreixo, Vera.

Sci Rep ; 7(1): 728, 2017 04 07.

Artigo em Inglês | MEDLINE | ID: mdl-28389642

RESUMO

We address the problem of discovering pairs of symmetric genomic words (i.e., words and the corresponding reversed complements) occurring at distances that are overrepresented. For this purpose, we developed new procedures to identify symmetric word pairs with uncommon empirical distance distribution and with clusters of overrepresented short distances. We speculate that patterns of overrepresentation of short distances between symmetric word pairs may allow the occurrence of non-standard DNA conformations, such as hairpin/cruciform structures. We focused on the human genome, and analysed both the complete genome as well as a version with known repetitive sequences masked out. We reported several well-defined features in the distributions of distances, which can be classified into three different profiles, showing enrichment in distinct distance ranges. We analysed in greater detail certain pairs of symmetric words of length seven, found by our procedure, characterised by the surprising fact that they occur at single distances more frequently than expected.

Assuntos

DNA , Genoma Humano , Genômica , Análise de Sequência de DNA , Algoritmos , Cromossomos Humanos , DNA/química , DNA/genética , Bases de Dados Genéticas , Genômica/métodos , Humanos , Cadeias de Markov , Modelos Genéticos , Conformação de Ácido Nucleico , Análise de Sequência de DNA/métodos , Relação Estrutura-Atividade

4.

Exceptional Symmetry by Genomic Word : A Statistical Analysis.

Afreixo, Vera; Rodrigues, João M O S; Bastos, Carlos A C; Tavares, Ana H M P; Silva, Raquel M.

Interdiscip Sci ; 9(1): 14-23, 2017 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-27866321

RESUMO

Single-strand DNA symmetry is pointed as a universal law observed in the genomes from all living organisms. It is a somewhat broadly defined concept, which has been refined into some more specific measurable effects. Here we discuss the exceptional symmetry effect. Exceptional symmetry is the symmetry effect beyond that expected in independence contexts, and it can be measured for each word, for each equivalent composition group, or globally, combining the effects of all possible words of a given length. Global exceptional symmetry was found in several species, but there are genomic words with no exceptional symmetry effect, whereas others show a very high exceptional symmetry effect. In this work, we discuss a measure to evaluate the exceptional symmetry effect by symmetric word pair, and compare it with others. We present a detailed study of the exceptional symmetry by symmetric pairs and take the CG content into account. We also introduce and discuss the exceptional symmetry profile for the DNA of each organism, and we perform a multiple comparison for 31 genomes: 7 viruses; 5 archaea; 5 bacteria; 14 eukaryotes.

Assuntos

Genômica/métodos , Modelos Genéticos , Estatística como Assunto/métodos , DNA de Cadeia Simples/genética

5.

The exceptional genomic word symmetry along DNA sequences.

Afreixo, Vera; Rodrigues, João M O S; Bastos, Carlos A C; Silva, Raquel M.

BMC Bioinformatics ; 17: 59, 2016 Feb 03.

Artigo em Inglês | MEDLINE | ID: mdl-26842742

RESUMO

BACKGROUND: The second Chargaff's parity rule and its extensions are recognized as universal phenomena in DNA sequences. However, parity of the frequencies of reverse complementary oligonucleotides could be a mere consequence of the single nucleotide parity rule, if nucleotide independence is assumed. Exceptional symmetry (symmetry beyond that expected under an independent nucleotide assumption) was proposed previously as a meaningful measure of the extension of the second parity rule to oligonucleotides. The global exceptional symmetry was detected in long and short genomes. RESULTS: To explore the exceptional genomic word symmetry along the genome sequences, we propose a sliding window method to extract the values of exceptional symmetry (for all words or by word groups). We compare the exceptional symmetry effect size distribution in all human chromosomes against control scenarios (positive and negative controls), testing the differences and performing a residual analysis. We explore local exceptional symmetry in equivalent composition word groups, and find that the behaviour of the local exceptional symmetry depends on the word group. CONCLUSIONS: We conclude that the exceptional symmetry is a local phenomenon in genome sequences, with distinct characteristics along the sequence of each chromosome. The local exceptional symmetry along the genomic sequences shows outlying segments, and those segments have high biological annotation density.

Assuntos

Cromossomos Humanos/genética , DNA/genética , Genoma Humano , Modelos Genéticos , Modelos Estatísticos , Genômica , Humanos , Transcriptoma

6.

Analysis of single-strand exceptional word symmetry in the human genome: new measures.

Afreixo, Vera; Rodrigues, João M O S; Bastos, Carlos A C.

Biostatistics ; 16(2): 209-21, 2015 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-25190514

RESUMO

Some previous studies suggest the extension of Chargaff's second rule (the phenomenon of symmetry in a single DNA strand) to long DNA words. However, in random sequences generated under an independent symbol model where complementary nucleotides have equal occurrence probabilities, we expect the phenomenon of symmetry to hold for any word length. In this work, we develop new statistical methods to measure the exceptional symmetry. Exceptional symmetry is a refinement of Chargaff's second parity rule that highlights the words whose frequency of occurrence is similar to that of its reversed complement but dissimilar to the frequencies of occurrence of other words which contain the same number of nucleotides A or T. We analyze words of lengths up to 12 in the complete human genome and in each chromosome separately. We assess exceptional symmetry globally, by word group, and by word. We conclude that the global symmetry present in the human genome is clearly exceptional and significant. The chromosomes present distinct exceptional symmetry profiles. There are several exceptional word groups and exceptional words with a strong exceptional symmetry.

Assuntos

DNA/genética , Genoma Humano/genética , Modelos Genéticos , Modelos Estatísticos , Humanos

7.

Exceptional single strand DNA word symmetry: analysis of evolutionary potentialities.

Afreixo, Vera; Rodrigues, João M O S; Bastos, Carlos A C.

J Integr Bioinform ; 11(3): 250, 2014 Oct 23.

Artigo em Inglês | MEDLINE | ID: mdl-25339084

RESUMO

Some previous studies point to the extension of Chargaff’s second rule (the phenomenon of symmetry) to words of large length. However, in random sequences generated by an independent symbol model where the probability of occurrence of complementary nucleotides is the same, we expect that the phenomenon of symmetry holds for all word lengths. In this work, we measure the symmetry above that expected in independence contexts (exceptional symmetry), for several organisms: viruses; archaea; bacteria; eukaryotes. We also create 27 control scenarios with the same length of each genome under study. The results for each organism were compared to those obtained in control scenarios. We created a new organism genomic signature consisting of a vector of the measures of exceptional symmetry for words of lengths 1 through 12. We show that the proposed signature is able to capture essential relationships between organisms.

Assuntos

Sequência de Bases , DNA de Cadeia Simples/genética , Evolução Molecular , Animais , Archaea/genética , Bactérias/genética , Genoma , Humanos , Filogenia , Vírus/genética

8.

XS: a FASTQ read simulator.

Pratas, Diogo; Pinho, Armando J; Rodrigues, João M O S.

BMC Res Notes ; 7: 40, 2014 Jan 16.

Artigo em Inglês | MEDLINE | ID: mdl-24433564

RESUMO

BACKGROUND: The emerging next-generation sequencing (NGS) is bringing, besides the natural huge amounts of data, an avalanche of new specialized tools (for analysis, compression, alignment, among others) and large public and private network infrastructures. Therefore, a direct necessity of specific simulation tools for testing and benchmarking is rising, such as a flexible and portable FASTQ read simulator, without the need of a reference sequence, yet correctly prepared for producing approximately the same characteristics as real data. FINDINGS: We present XS, a skilled FASTQ read simulation tool, flexible, portable (does not need a reference sequence) and tunable in terms of sequence complexity. It has several running modes, depending on the time and memory available, and is aimed at testing computing infrastructures, namely cloud computing of large-scale projects, and testing FASTQ compression algorithms. Moreover, XS offers the possibility of simulating the three main FASTQ components individually (headers, DNA sequences and quality-scores). CONCLUSIONS: XS provides an efficient and convenient method for fast simulation of FASTQ files, such as those from Ion Torrent (currently uncovered by other simulators), Roche-454, Illumina and ABI-SOLiD sequencing machines. This tool is publicly available at http://bioinformatics.ua.pt/software/xs/.

Assuntos

Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Análise de Sequência de DNA/métodos , Algoritmos , Reprodutibilidade dos Testes , Software

9.

The breakdown of the word symmetry in the human genome.

Afreixo, Vera; Bastos, Carlos A C; Garcia, Sara P; Rodrigues, João M O S; Pinho, Armando J; Ferreira, Paulo J S G.

J Theor Biol ; 335: 153-9, 2013 Oct 21.

Artigo em Inglês | MEDLINE | ID: mdl-23831271

RESUMO

Previous studies have suggested that Chargaff's second rule may hold for relatively long words (above 10nucleotides), but this has not been conclusively shown. In particular, the following questions remain open: Is the phenomenon of symmetry statistically significant? If so, what is the word length above which significance is lost? Can deviations in symmetry due to the finite size of the data be identified? This work addresses these questions by studying word symmetries in the human genome, chromosomes and transcriptome. To rule out finite-length effects, the results are compared with those obtained from random control sequences built to satisfy Chargaff's second parity rule. We use several techniques to evaluate the phenomenon of symmetry, including Pearson's correlation coefficient, total variational distance, a novel word symmetry distance, as well as traditional and equivalence statistical tests. We conclude that word symmetries are statistical significant in the human genome for word lengths up to 6nucleotides. For longer words, we present evidence that the phenomenon may not be as prevalent as previously thought.

Assuntos

Cromossomos Humanos/genética , Genoma Humano/fisiologia , Modelos Genéticos , Cromossomos Humanos/metabolismo , Humanos , Transcriptoma/fisiologia

10.

Inter-dinucleotide distances in the human genome: an analysis of the whole-genome and protein-coding distributions.

Bastos, Carlos A C; Afreixo, Vera; Pinho, Armando J; Garcia, Sara P; Rodrigues, João M O S; Ferreira, Paulo J S G.

J Integr Bioinform ; 8(3): 172, 2011 Sep 15.

Artigo em Inglês | MEDLINE | ID: mdl-21926435

RESUMO

We study the inter-dinucleotide distance distributions in the human genome, both in the whole-genome and protein-coding regions. The inter-dinucleotide distance is defined as the distance to the next occurrence of the same dinucleotide. We consider the 16 sequences of inter-dinucleotide distances and two reading frames. Our results show a period-3 oscillation in the protein-coding inter-dinucleotide distance distributions that is absent from the whole-genome distributions. We also compare the distance distribution of each dinucleotide to a reference distribution, that of a random sequence generated with the same dinucleotide abundances, revealing the CG dinucleotide as the one with the highest cumulative relative error for the first 60 distances. Moreover, the distance distribution of each dinucleotide is compared to the distance distribution of all other dinucleotides using the Kullback-Leibler divergence. We find that the distance distribution of a dinucleotide and that of its reversed complement are very similar, hence, the divergence between them is very small. This is an interesting finding that may give evidence of a stronger parity rule than Chargaff's second parity rule.

Assuntos

Variação Genética/fisiologia , Genoma Humano/fisiologia , Fases de Leitura/fisiologia , Análise de Sequência de DNA/métodos , Animais , Humanos

11.

Minimal absent words in prokaryotic and eukaryotic genomes.

Garcia, Sara P; Pinho, Armando J; Rodrigues, João M O S; Bastos, Carlos A C; Ferreira, Paulo J S G.

PLoS One ; 6(1): e16065, 2011 Jan 31.

Artigo em Inglês | MEDLINE | ID: mdl-21386877

RESUMO

Minimal absent words have been computed in genomes of organisms from all domains of life. Here, we explore different sets of minimal absent words in the genomes of 22 organisms (one archaeota, thirteen bacteria and eight eukaryotes). We investigate if the mutational biases that may explain the deficit of the shortest absent words in vertebrates are also pervasive in other absent words, namely in minimal absent words, as well as to other organisms. We find that the compositional biases observed for the shortest absent words in vertebrates are not uniform throughout different sets of minimal absent words. We further investigate the hypothesis of the inheritance of minimal absent words through common ancestry from the similarity in dinucleotide relative abundances of different sets of minimal absent words, and find that this inheritance may be exclusive to vertebrates.

Assuntos

Células Eucarióticas/metabolismo , Genoma/genética , Células Procarióticas/metabolismo , Animais , Composição de Bases/genética , Sequência de Bases , Padrões de Herança/genética , Dados de Sequência Molecular , Nucleotídeos/genética , Filogenia , Vertebrados/genética

12.

On finding minimal absent words.

Pinho, Armando J; Ferreira, Paulo J S G; Garcia, Sara P; Rodrigues, João M O S.

BMC Bioinformatics ; 10: 137, 2009 May 08.

Artigo em Inglês | MEDLINE | ID: mdl-19426495

RESUMO

BACKGROUND: The problem of finding the shortest absent words in DNA data has been recently addressed, and algorithms for its solution have been described. It has been noted that longer absent words might also be of interest, but the existing algorithms only provide generic absent words by trivially extending the shortest ones. RESULTS: We show how absent words relate to the repetitions and structure of the data, and define a new and larger class of absent words, called minimal absent words, that still captures the essential properties of the shortest absent words introduced in recent works. The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word. We describe an algorithm for generating minimal absent words that, in practice, runs in approximately linear time. An implementation of this algorithm is publicly available at ftp://www.ieeta.pt/~ap/maws. CONCLUSION: Because the set of minimal absent words that we propose is much larger than the set of the shortest absent words, it is potentially more useful for applications that require a richer variety of absent words. Nevertheless, the number of minimal absent words is still manageable since it grows at most linearly with the string size, unlike generic absent words that grow exponentially. Both the algorithm and the concepts upon which it depends shed additional light on the structure of absent words and complement the existing studies on the topic.

Assuntos

Algoritmos , Sequência de Bases , DNA/química , Genômica/métodos , Análise de Sequência de DNA/métodos , Bases de Dados de Ácidos Nucleicos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA